Table of Contents

Problem Statement

Work with the attached dataset to identify users who are more likely to do a sale.

It's important for the business to understand what makes someone more likely to buy an insurance. For this purpose, we assembled a dataset with features of users and their sessions and an indicator if they resulted in a sale or not.

Your goal is to work with this dataset and identify users who are more likely to convert so that we can personalise the experience for them.

We don't expect a perfect solution (actually, we believe there isn't one). But please take several hours to implement a well structured approach in a Jupyter notebook.

Afterwards, we'll discuss your solution in detail. Be ready to answer question like, why did you choose this approach and not others, how will you measure the performance of your solution and how do you interpret the final results


Data understanding

field name description
id unique identifier of the rows
date date of the session
campaign_id id of the advertising campaign that led the user to the site
group_id id of the group that lead the user to the site
age_group age range of the user
gender gender of the user
user_type internal id of the type of user
platform device type of the user
state_id US state id of the user location
interactions number of interactions of the user
sale boolean indicator if the user has made a sale or not

Read-in the data

Let's use pandas_profiling for a quick and exhaustive summary of all features and their interactions

Below are some observations extracted from pandas_profiling:

  1. There are 28 missing values in the target sale. We will safely remove those from the dataset, prior to any further analysis
  1. Huge imbalance in the target sale - 93.6% of all observations have not resulted in a sale. This imbalance must be addressed before model training and will determine the performance metric we chose.
  1. There is considerable imbalance in the feature platform - approximately 63% of users come from a desktop machine, 36% have missing values and less than 1% come from mobile devices. Depending on whether the missing values represent merely a data quality issue or an actual meaningful signal about user behaviour (e.e. user redirected through third parties have "undetected" platform), we can either impute the missing values or model them as a separate entity. In this submission I have chosen the latter.
  1. All features, except interactions are categorical. We will have to encode them to be able to use most machine learning methods. In this submission I have tried two methods - one-hot encoding and leave-one-out target coding. They both perform similarly, so I leave the latter in the main code.
  1. The feature interactions is heavily skewed. We will log-transform it to make it more Gaussian-like and hence reduce the likelihood that its heavy tails bias our models.
  1. The features group_id and user_type have high cardinalities. If categorical features are one-hot encoded this will be a problem, as this will induce a lot of spareness in the encoded dataset. Furthermore, features with many categories will bias most learning algorithms in their favour. I have tried one-hot encoding as part of this submission and it only works at the level of the target-coding if the high cardinality features are compacted (e.g. by creating a new category "Other")
  1. The Phi_K correlation plot suggests correlation between sales and the features age_group, gender and state_id. Despite being only visual this observation is a first indication of what features might be influential in this problem and should be verified.
  1. The datefeature has only 9 distinct values and doesn't provide enough information for time series analysis. We will thus drop it.

Clean-up

Let's remove the observation with missing target sale

As mentioned before we will keep the mising values as a special category to be explicitly modelled.

This affects mostly the platform feature, since it has the highest fraction of missing values.

And finally removing date

Train-test split

Before we continue we'll allocate a test set to be used as a proxy for unseen data. That would be the final dataset on which models will be compared at the end.

Importantly this must be done before any encoding or rebalancing. We want to keep the data imbalance in the test set and check model performance on it. In this way we'll know if our model has learned useful information to detect the positive signal resulting in a sale.

Any rebalancing or encoding will be learned from the training dataset to ensure no data leakage.

To generate the split and keep the data imbalance, we use StratifiedShuffleSplit and take 20% of the data as test.

As we can see the imbalance in sale is very similar across the original dataset, the training dataset and the test dataset.

Data Processing Pipelines

As a last step before modelling, we'll create the encoding and rebalancing logic.

There are many options to encode categorical variables with the most popular one being one-hot encoding. Its alternative name, dummy coding, suggests that it may not be the obvious choice for all problems. Nevertheless I tried it and it offers very similar performance if the high cardinality features are compacted.

Another category of encoding methods is known as target coding, where the main idea is that information about the target (e.g. mean value) conditional on the individual categories is used to encode them. The encoding is typically futher combined with the so-called prior, which represents a global statistic unconditioned on any category.

There are several target-coding methods implemented in the Python package category_encoders.

I have tried TargetEncoder and LeaveOneOutEncoder, both with similar performance.

Before, we continue let's define a function to log-transform the interactions feature. It will be used in the final sklearn.pipeline.Pipeline

Log-transform interactions

Categorical Encoding

Option 1 - Target coding

Option 2 - One hot encoding

And finally define the encoding pipeline

We now learn the encoding from our training dataset and use it to transform the test dataset.

This is how our encoded training dataset looks like:

This is how our encoded test dataset looks like:

Re-balancing sale

Rebalancig the sale feature is very important during training, if we don't want our models to simply learn to predict the majority class.

While choosing a scoring metric appropriate for imbalanced datasets is certainly important, that alone is often not enough.

For this reason, in addition to choosing an appropriate metric, we will rebalance the target

Because there're plenty of samples, I have opted for a downsampling strategy of the majority class (no sale). By reducing the size of the dataset all further analysis downstream (including modelling) will be sped up, which in the context of this submission was an important factor.

There are many ways to down- or under-sample a majority class. By far a popular first choice is the random under-sampler, which, as the name suggests, randomly selects majority samples until a desired proportion of majority and minority samples is achieved.

The random under-sampler is extremely fast and suitable for mixed data. Caution must be taken to sample sufficiently from the distribution of the majority class, so that its properties are preserved in the undersampled dataset.

I have also tried other popular methods from imblearn, which had similar results and the cost of processing time.

For this reason, the random under-sampler is provided in the main code.

Random undersampler

And this is the rebalanced training data

corresponding to the original:

As we see the rebalanced training data has an equal proportion of minority and majority samples


Modelling

Before we begin modelling, we need to define the performance metric to optimize.

The f1 score and balanced accuracy are suitable metrics for imbalanced datasets. Even though we rebalanced our training data, we still want to be sensitive to a classifier making trivial predictions on the (imbalanced) test data.

For this reason the f1 score will be used as a performance metric with the added convenience that to compute it we will require the precision and recall of our classifiers.

These are important indicators, specifically when a tradeoff between false positives and true negatives is an integral part of the business question, as it is here.

Essentially low precision, but high recall, will allows us to identify interested customers well, at the expense of maybe bothering uninterested customers. In this situation we might even tolerate lower precision, because in a real applicatin the list of potential customers generated by our model will typically be curated to avoid contacting the same person too often. Hence falsely reaching out to an uninterested buyer may be acceptable from a business perspective.

And vice versa, high precision and low recall will ensure that we only spend effort in contacting "sure" sales, at the expense of possibly omitting others with buying inclinations.


As a first point, let's calculate the expected f1 score of a dummy classifier that always predicts the minority class on the test set (i.e. recall is 1). Such a classifier does not learn from the training data at all.

This will serve as our lowest possible benchmark. If a model performs worse than that, we will know there've been serious issues during training.

Baseline Models

In addition to the dummy baseline, we'll define two more classical machine learning models to serve as proper baselines. These are logistics regression and random forest classifier.

Both serve as a good first choice due to their efficiency, simplicity and interpretability.

The logistics regression will be implemented in two different ways, one with the statsmodels package and the other with sklearn.linear_model. The statsmodels implementation allows for a more convenient specification of the categorical variables (via the patsy package), while the sklearn implementation requires categorical encoding. We will use the raw unencoded training data for the former and the encoded training data for the latter.

Since by default statsmodels uses one-hot-encoding we will. thus easily compare that encoding scheme with our target-coding.

Statsmodels Logistics Regression

Since statmodels will use one-hot-encoding by defauls we need to pay attention to the high cardinality features, group_id and `user_type``. These will bias the model toward them and also induce a lot of sparsness in the dataset.

Experiments showed that sparseness is an issue, since the MLE estimation fails to converge.

Hence, these will be dropped from the statsmodels implementation.

The logistics regression essentially models the logarithm of the odds raio as a linear function of the independent variables. This means that the coefficients tell us how many units of change one unit of change in the independent variables cause in the dependent variable (log ods ratio).

It's often, however, easier to work with the odds ratio and not with its logarithm. Hence, below we exponentiate the coefficients to obtain the effects of each feature on the odds ratio, that is the ratio of probability of sale over the probability of no sale.

Let's see only the odds ratios corresponding to categories with considerable statistical significance

Already here we see some interesting insights:

  1. Four features appear to be most prominent - campain_id, age_group, gender and state_id. Incidently, three of those we have already hypothesized to be correlated with the target in our visual inspection with pandas_profiling
  1. The odds ratios increase monotonously with age. The probability of sale is 2.52 times higher among customers in the 45-49 age group, compared to the treatment 30-34 age group. Likewise for the other age groups. This is reasonable, as it could be argued that younger people are less interested in buying insurance than older ones
  1. Males have 0.7 times less probability than females of buying insurace. That is interesting, but without more context, it's hard to reality check it. However, it may be related to aggresiveness and risk aversion
  1. Certain state_ids, namely 24, 25 and 27 almost triple the probability of a sale. That's quite interesting and deservers further investigations with more context.

Logistics Regression

This is the standard implementation with some hyper-parameter tuning.

Random Forest Classifier

The final baseline model is the random forest classifier. It's fast and resilient to overfitting due to its usage of bootstrap samples. For this reason, we're not explicitly using cross-validation.

Comparison of Baseline Models

In this section we'll compare the performance of our baselines on the test data and try to figure out what they have learnt from the data.

The comparison will be done along three main directions:

  1. Feature importances
  2. F1 Score
  3. RoC and Precision-Recall curves

Let's visualize the histograms of the predictions. That alone will already gives us a good indication about how each baseline has perfomed.

A few things can be extracted from these diagnostic plots:

  1. The Random Forest classifier has the most difficulties detecting the positive signal in the test data. Even though only a minority of its predictions are in the upper range of the probability distribution (as expected due to the heavily imbalanced test data), there is no clear separation between the two classes. All in all the majority of the predictions are centered around 0.5, which suggests that the classifier is unsure in most cases
  1. The two logistics regressions look very similar. Recall that the only substantial difference between the two is how the categorical features are encoded. There is a much clearer separation between a minority class in the upper range of the probability distribution and the rest (rest corresponding to the majority class of "no sale"). However, the rest is not clearly separated into its own cluster towards the lower range of the probability distribution. Instead there are two peaks around 0.4 and 0.6. This indicates that the models are unsure as to what really makes a negative sample (no sale) and suggests that we will have a lot of false positives (low precision).

Feature Importance

There are many ways to obtain feature importances. One way that is model agnostic is to compute the permutation feature importances.

Each feature is shuffled several times and the effect on the model metric are averaged to produce the average impact the value of that feature has.

This is a great way to capture non-linear dependencies in a model agnostic way.

For both models we see that the top 3 features that stand out in terms of importance are age_group, state_id and gender.

These are incidently the features we had already identified from our preliminary descriptive analysis.

F1 Score

The standard f1 score is computed assuming a classification threshold of 0.5

Even though all baselines are doing better than the dummy classifier, the improvement is not considerable.

Let's see how the f1 score depends on the classification threshold:

The solid horizontal black line indicates the f1 score expected by the dummy classifier, while the vertical grey line is the threshold at which the maximum f1 score is achieved by our baselines.

The bottom line here is that the baselines are definitely learning to distinguish the positive signal of a sale and do offer improvement over an uninformed classifier, however, there're still issues with the precision.

We will see these issues more concretely below

ROC Curve

The RoC curve offers as a good way to balance false positives and true positives.

The AUC scores a further indication that the improvement over the dummy classifier (red line) are not considerable.

Precision-Recall Curves

Below we look at the corresponding tradeoffs between precision and recall, which ultimately determine the f1 score we choose.

The solid circles indicate the points in the curves that achieve the maximum f1 score of all three baseline models.

The optimal point on this curve corresponds to a relatively high recall, but low precision.

We have already seen an indication for that from the distribution plots of the predictions. The models seem to have troubles determining what exactly consitutes a "no sale", while being able to extract useful information on what constitues a sale.

We are of course interested in predicting sales, so high recall can be seen as beneficial, however, as already mentioned low precision runs the risk of approaching disinterested customers and alienating them if pressing too aggresively. In a real-life campaign, there will most certainly be post-processing checks (typically called collision management) that would ensure the same customer is not approached too often.

Challenger Model

To challenge our simple baselines, we will try two more advanced methods.

One is a gradient boosting method and the other a simple neural network.

We will implement Gradient Boosting with the catboost package. The package offers a very flexible implementation that encodes categorical variables automatically. It uses different encoding schemes depending on the cardinality of each categorical variable and in practive works well out of the box.

For this reason we'll use the raw categorical features in catboost, but the corresponding code that uses our custom taget coding scheme is still available as an alternative.

Gradient Boosting

Catboost offers an easy way to extract feature importances. The default method, PredictionValueChange is similar to permutation feature importance, so we it.

Unsurprisingly, state_id, age_group and gender are again on the top.

Neural Network

Nothing fancy, just a simple fully connected neural network.

Feature importances for the neural network can be easily obtained with the shap method.

Importantly, if one-hot coding had been chosen shap is not reliable, hence we'll focus only on the case with target coding.

Comparison Baseline Models <-> Challengers

Prediction distributions

We'll add the distribution of the predictions on the test dataset from the two challenger models to the ones from the baselines

As indicated by the two more pronounced peaks in the histograms, the challenger models appear to be slightly more discerning about the majority class, while still identifying a small minority. However, the second higher peak suggests that we still have a lot of "no sales" cases, which get disproportionately classified as "sales".

Therefore the challenger models, as well, have difficulty in identifying "no sales", albeit to a lesser extent.

F1 scores

The F-score further confirm that the two challenger models do not offer noticeable improvement over the baselines.

Let's see how the f1 score depends on the classification threshold:

Again, the logistics regression appears to be outperfoming all the others

ROC Curve

Precision-Recall Curves

Again the optimal point on this curve corresponds to a relatively high recall, but low precision.

All models seem to have troubles determining what exactly consitutes a "no sale", while being able to extract useful information on what constitues a sale.

Summary

Below are some summary points to wrap things up.

  1. Predicting the minority class is a hard problem in general, especially in the presence of high target imbalance and high cardinality categorical features.
  1. The presence of mostly categorical features further complicates the issue. Experimenting with different encoding schemes is a major part of data preparation.
  1. All models tend to improve recall at the expense of precision, which might be acceptable from a business perspective.
  1. All models (baseline+challenger) have very similar performance in terms of the F score. It's low, but still better than an uninformed model.
  1. Due to the low performance out of the box, a more efficient approach for improvement would be at the data processing stage rather than hyperparameter tuning.
  1. Three features appear to be most informative on predicting a sale across all models - gender, age_group and state_id.
  1. Extracting information about the majority class (no sale) appears to be a challenge. This suggests that the influential features above contain sufficient information about "sale" events, but ambiguous about "no sale".

    This is akin to a hypothetical situation where a feature, such as first_car=1, is highly predictive of a customer being interested in car insurance (because it's mandated by law), yet first_car=0 alone tells us litte about interest in car insurance per se.

  1. Try separate models for different slices of the three most informative features. Sometimes customer behaviour is distinct enough across important dimensions and it's better ot model it separately.
  1. Any model that ends up being selected should be calibrated as a last step. Having calibrated predictions is not that important if customers are scored accoring to their probability of buying an insurance, but is crucial if expected returns are to be factored into the business decision.